by Sam Morrow
## X fixed.acidity volatile.acidity citric.acid
## Min. : 2 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1227 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2451 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2452 Mean : 6.854 Mean :0.2781 Mean :0.3334
## 3rd Qu.:3676 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3800
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.000 Median :0.04300 Median : 34.00
## Mean : 6.148 Mean :0.04568 Mean : 35.22
## 3rd Qu.: 9.600 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :17.950 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.72 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.09 1st Qu.:0.4100
## Median :133.0 Median :0.9937 Median :3.18 Median :0.4700
## Mean :137.9 Mean :0.9939 Mean :3.19 Mean :0.4902
## 3rd Qu.:167.0 3rd Qu.:0.9959 3rd Qu.:3.28 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0024 Max. :3.82 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.53 Mean :5.885
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.25453969 0.002809863
## fixed.acidity -0.254539689 1.00000000 -0.024941981
## volatile.acidity 0.002809863 -0.02494198 1.000000000
## citric.acid -0.146465102 0.28743510 -0.161408157
## residual.sugar 0.012545851 0.08951534 0.050190875
## chlorides -0.048043133 0.02434731 0.071430730
## free.sulfur.dioxide -0.008737141 -0.05059463 -0.093568972
## total.sulfur.dioxide -0.161283394 0.08974740 0.092707018
## density -0.194756843 0.27568774 0.006473784
## pH -0.119515777 -0.42827622 -0.034870905
## sulphates 0.007607229 -0.01611827 -0.038408212
## alcohol 0.214087328 -0.12297629 0.063132761
## quality 0.032062686 -0.11376730 -0.198627115
## citric.acid residual.sugar chlorides
## X -0.146465102 0.01254585 -0.04804313
## fixed.acidity 0.287435102 0.08951534 0.02434731
## volatile.acidity -0.161408157 0.05019088 0.07143073
## citric.acid 1.000000000 0.08135806 0.11810290
## residual.sugar 0.081358056 1.00000000 0.08170230
## chlorides 0.118102903 0.08170230 1.00000000
## free.sulfur.dioxide 0.091529288 0.31502227 0.10176926
## total.sulfur.dioxide 0.116471355 0.40652142 0.19885721
## density 0.144726507 0.82040112 0.26263441
## pH -0.165932129 -0.19055836 -0.09076344
## sulphates 0.064584287 -0.02303271 0.01656835
## alcohol -0.079091027 -0.45178076 -0.36152981
## quality -0.006495807 -0.08415482 -0.20784465
## free.sulfur.dioxide total.sulfur.dioxide density
## X -0.008737141 -0.16128339 -0.194756843
## fixed.acidity -0.050594628 0.08974740 0.275687737
## volatile.acidity -0.093568972 0.09270702 0.006473784
## citric.acid 0.091529288 0.11647136 0.144726507
## residual.sugar 0.315022273 0.40652142 0.820401125
## chlorides 0.101769259 0.19885721 0.262634409
## free.sulfur.dioxide 1.000000000 0.61377314 0.308932572
## total.sulfur.dioxide 0.613773142 1.00000000 0.543326254
## density 0.308932572 0.54332625 1.000000000
## pH 0.003866531 0.01068473 -0.086258076
## sulphates 0.059736861 0.13396328 0.081269157
## alcohol -0.247918033 -0.44449032 -0.807752697
## quality 0.011901107 -0.17134265 -0.312454677
## pH sulphates alcohol quality
## X -0.119515777 0.007607229 0.21408733 0.032062686
## fixed.acidity -0.428276224 -0.016118267 -0.12297629 -0.113767296
## volatile.acidity -0.034870905 -0.038408212 0.06313276 -0.198627115
## citric.acid -0.165932129 0.064584287 -0.07909103 -0.006495807
## residual.sugar -0.190558361 -0.023032706 -0.45178076 -0.084154817
## chlorides -0.090763444 0.016568351 -0.36152981 -0.207844646
## free.sulfur.dioxide 0.003866531 0.059736861 -0.24791803 0.011901107
## total.sulfur.dioxide 0.010684731 0.133963276 -0.44449032 -0.171342647
## density -0.086258076 0.081269157 -0.80775270 -0.312454677
## pH 1.000000000 0.158215782 0.11472929 0.097288873
## sulphates 0.158215782 1.000000000 -0.01838456 0.052596089
## alcohol 0.114729292 -0.018384558 1.00000000 0.433607807
## quality 0.097288873 0.052596089 0.43360781 1.000000000
The data is very clean, and is ordered into rows with single measured observations featuring: - Fixed Acidity - Volatile Acidity - Citric Acid - Residual Sugar - Chlorides - Free Sulfer Dioxide - Total Sulfer Dioxide - Density - pH - Sulphates - Alcohol
Each row then has a quality rating, which has been gauged by at least 3 experts.
I’m interested in trying to model wine quality, by looking at correlations between quality and other features, so quality is my primary feature of interest. Alcohol has the strongest correlation (0.436) to quality, so I will have to factor that into my investigations (in spite of my subjective experience that there are quality wines of many strengths), as this is the strongest correlation in the the dataset, one other thing is clear: no single variable is a strong enough indicator of quality to.
I am surpised that density has second the strongest correlation to quality. After some reflection, I think that density may represent the balancing of other features (as sugar content etc. affect this property), which could help constructing a model, however I am unconvinced that humans can conciously perceive such minor fluctuations in density.
The features with significant correlation to quality (other than Density are Alcohol) are Chlorides, Volatile Acidity and Total Sulfur Dioxide.
I’ve created a boolean column for high and low quality wines, so that I could generalise about the data in different ways.
I found that wine with a residual sugar >= 22g/l was producing much less predicable results. We do not have many data points above this value, and the wines vary dramatically in residual sugar beyond this point, so I have made an executive decision to filter it out. I looked into wine sweetness and wine between 18 to 45 g/l is considered to be medium - so I am limiting my analysis to dry - medium dry wine and using 18 g/l as the cutoff. I think if we were to seriously analyise sweeter wines, we would need more data.
The distribution of residual sugar on the histogram is interesting, because it is roughly positively skewed.
Sulphates seem to be bi-modal while alcohol is in a sense multi-modal, and my guess here is that there are different classes / categories of wine, targeting certain areas, and far from a coincidence, these values might be inidicative of modal wines in different genres (or with different preservative requirements). It would be interesting to have had more categorical data for the wines, to analyse subsets, as well as white wine in general.
##
## Call:
## lm(formula = wine$quality ~ wine$density + I(wine$density^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1926 -0.6053 0.0921 0.4150 3.4251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17829 1370 13.01 <2e-16 ***
## wine$density -35754 2756 -12.97 <2e-16 ***
## I(wine$density^2) 17932 1386 12.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8299 on 4810 degrees of freedom
## Multiple R-squared: 0.128, Adjusted R-squared: 0.1276
## F-statistic: 352.9 on 2 and 4810 DF, p-value: < 2.2e-16
##
## Calls:
## m1: lm(formula = wine$quality ~ wine$density + I(wine$density^2))
## m2: lm(formula = wine$quality ~ wine$density + I(wine$density^2) +
## wine$alcohol + I(wine$alcohol^2) + I(wine$alcohol^3) + I(wine$alcohol^4))
## m3: lm(formula = wine$quality ~ wine$density + I(wine$density^2) +
## wine$alcohol + I(wine$alcohol^2) + I(wine$alcohol^3) + I(wine$alcohol^4) +
## wine$residual.sugar)
## m4: lm(formula = wine$quality ~ wine$density + I(wine$density^2) +
## wine$alcohol + I(wine$alcohol^2) + I(wine$alcohol^3) + I(wine$alcohol^4) +
## wine$residual.sugar + wine$volatile.acidity)
##
## ===================================================================================
## m1 m2 m3 m4
## -----------------------------------------------------------------------------------
## (Intercept) 17828.653*** 5674.429*** 3697.429* 439.918
## (1370.216) (1657.444) (1657.013) (1604.950)
## wine$density -35754.517*** -11204.143*** -7134.415* -699.446
## (2756.356) (3336.875) (3337.274) (3231.744)
## I(wine$density^2) 17931.604*** 5646.941*** 3545.262* 316.367
## (1386.178) (1677.259) (1678.273) (1625.165)
## wine$alcohol -38.487* -36.263* -16.768
## (18.657) (18.497) (17.845)
## I(wine$alcohol^2) 4.855 4.666 1.975
## (2.558) (2.536) (2.446)
## I(wine$alcohol^3) -0.265 -0.261 -0.100
## (0.155) (0.154) (0.148)
## I(wine$alcohol^4) 0.005 0.005 0.002
## (0.003) (0.003) (0.003)
## wine$residual.sugar 0.052*** 0.050***
## (0.006) (0.005)
## wine$volatile.acidity -2.181***
## (0.113)
## -----------------------------------------------------------------------------------
## R-squared 0.1 0.2 0.2 0.3
## adj. R-squared 0.1 0.2 0.2 0.3
## sigma 0.8 0.8 0.8 0.8
## F 352.9 202.9 189.1 225.2
## p 0.0 0.0 0.0 0.0
## Log-likelihood -5930.3 -5716.6 -5674.3 -5493.5
## Deviance 3312.6 3031.0 2978.2 2762.7
## AIC 11868.6 11449.1 11366.5 11007.0
## BIC 11894.5 11500.9 11424.8 11071.8
## N 4813 4813 4813 4813
## ===================================================================================
## [1] 0.522221
## [1] 3292
The density appears to decrease with alcohol, interestingly, there seems to be less variation in the higher alcohol area.
The median alcohol level increases almost linearly with medium quality wines, but goes the other way below that. Perhaps stronger wines are more harder to get right, but weaker wines are often quite average. The IQR certainly shows an interesting trend, although there are many outliers, and lots of overlap between the whiskers at each quality.
Many low quality wines have a high denisty, and most high quality wines have a lower density. The variance is high here, so it’s hard to draw strong conclusions.
There is a negative correlation between pH and fixed acitidy. The surprising part for me is how weak a corellation this is. I expected it would be a different measure for effectively the same value.
The amount of chlorides seem to increase with residual sugar (particularly the minimum chlorides). I would hypothesise that perhaps the salt content is increased to balance sweetness. I did not expect much correlation here so this is interesting.
Density and alcohol have a very strong relationship. I think this must be a yeast / sugar issue, where the stronger the alcohol the more sugar was consumed by the yeast, and the thinner the wine would be (or as vinters calculate when making wine, it would have a lower specific gravity). Starting gravity (i.e. initial sugar levels) and different strains of yeast would account for much of the variance, and environmenal factors such as fermentation temperature would likely also have an impact, but even with these there is an obvious relationship.
Quality seems very distributed when using it to colour alcohol V.S. density, in particular medium quality seems to be extremely varied. There is still a definite section of stronger alcohol, and lower density that is the largest area of high ratings.
When we remake the same plot, using the quality prediction model outputs, we can see effectively a charicature of the effect above, where it is too simplistic to genuinely reflect reality, but you can see significatn similarities.
Residual sugar and density also seem to produce a very clear distribution, and again very closely matched by the model aesthetically.
Too much total sulpher dioxide seems to reduce the chances of a wine being rated as high quality.
I created a linear model to predict quality, with a correlation of 0.522 to quality using density, alcohol, residual sugar, volatile.acidity.
The model for denisty was a 2 degree polynomial, and for alcohol I used a 4 degree polynomial (maybe overkill / overfitting here), and while that enabled more suble inflections in the resultant quality prediction (and a closer fit), it was still only a moderate increase in correlation from the correlation of denisty on its own (0.436). Ultimately this is a very weak model, and is not well suited to the task. I think it might be possible to improve it, and I should have separated some random sample data and test data, so I could at least have reduced the risk of over-fit.
Perhaps the subtlety of a neural network, could produce a more nuanced model than a linear one. The subjective rating wine quality is complex to model.
The main strength of this model is that when reduced to a boolean prediction (of quality > 5 or <=5) it gets it right 3292 / 4813 in this dataset. So with ~68% accuracy this model could be used, for example, to help position / price wine in a shop on the liklihood it would get an above or below average rating.
We can clearly see that when we show mean alcohol by approximate quality, over residual sugar, that at the low end of residual sugar, stronger wines are generally of a higher quality. As we go up the residual sugar scale, these averages converge. We can’t read too much into this except to say that as wine gets sweeter, strength becomes less of an indicator of quality. On both lines we see a negative trend for strength and sweetness, but this hides so much of the variance, I’m not sure that’s useful.
The key take-away is that if we are going to make generalisations about wine, we might want to categorise it further. It’s also fair to say that each axis here provides a degree of smoothing, and generalisation is the operative word here. There are many exceptions to this rule.
Strong, less dense wine is more likely to be of a higher quality. That was the main outcome of my research. The indicators aren’t very strong, there’s a lot of medium quality wine everywhere which makes it very difficult to make assumptions.
These plots amazed me a little at first. The combined strength of residual sugar and density as predictors for quality seems very strong. We can see a lot of similarity between both the real quality and the predicted quality. There is still a huge amount of noise, and lots of grey points, but we see a clearer picture here.
I think I’ve been able to create an interesting, although flawed model for predicting wine quality. I’ve been able to show that there are some real correlations in the data. I’m happy with those aspects of my analysis.
My biggest challenge was trying to find stronger predictors than any that actually existed in the data. Near 70% accuracy of “is it good?”, “is it bad?” was still a positive outcome, but I imagined that the detailed data on wine chemistry would be enough to overcome the subjectivity of the quality ratings. It seems there is still a little je ne sais quois in wine quality, that goes beyond our dataset.
I also cannot decide if filtering the data was a good step or a bad step. The combination of the data being scant in the high-sugar range, with it being more extreme lead me to believe it was not worth analysisng at this stage. I’m confident it helped me to draw more useful conclusions, so I think it wasn’t simply a case of convenience.
My model was very moderate! I am sad about this. It was not able to predict quality below 4 or above 7. This is in some ways to be expected, but it is very much a symptom of the variability / non-linear nature of the real data.
I think I could have possibly handled the interval quality data better too. Continuous values are easier to model, and a linear model will naturally produce values in between the outputs.